Introduction to Statistics for Astronomers and Physicists

Section 0: Course Format and Outline, and a Crash Course in R and Python

Dr Angus H Wright

2022-02-09

Welcome and Introduction

Welcome to the Introduction to Statistics for Astronomers and Physicists course for the Summer Semester 2021.

Course Philosophy

This course is designed to be a practical introduction to statistics for astronomers and physicists who are starting their research careers and have had little (or perhaps no) previous education in statistics and statistical data analysis. The course (and these lecture notes) are not designed to be a statistics reference text. Rather the material presented here is designed to guide students on a suitable path towards robust data analysis and research.

The course will present many aspects of data analysis that are widely relevant to modern astronomy and physics. We will borrow heavily from standard statistical problems and thought experiments in an effort to convey important points, and demonstrate common statistical and logical fallacies. Problems will almost always be explored using a mixture of tools simultaneously: plain English, math, computer code, graphs, and more.

Course Outline

This course will be taught in 4 parts, each spanning from 2-4 weeks

Section 1: Data Description, Analysis, and Modelling (Weeks 1-2)

When working in empirical science, modelling and understanding datasets is paramount. In this module we start by discussing the fundamentals of data modelling. We start by discussing theories of point and interval estimation, in the context of summary statics (e.g. expectation values, confidence intervals), and estimation of data correlation and covariance. Students will learn the fundamentals of data mining and analysis, in a way that is applicable to all physical sciences.

Topics include:

  • Types of data
  • Point & interval estimation
  • Correlation & covariance
  • Fundamentals of data exploration/mining
  • Introduction to data visualisation

Section 2: Probability & Decision Making (Weeks 3-5)

For all aspects of modern science, an understanding of probability is required. We cover a range of topics in probability, from decision theory and the fundamentals of probability theory, to standard probabilistic distributions and their origin. From this module, students will gain an insight into different statistical distributions that govern modern observational sciences, the interpretation of these distributions, and how one accurately models distributions of data in an unbiased manner.

Topics include:

  • Decision theory
  • Fundamentals of probability
  • Statistical distributions and their origins

Section 3: Bayesian Statistics (Weeks 6-8)

Bayes theorem led to a revolution in statistics, via the concepts of prior and posterior evidence. In modern astronomy and physics, applications of Bayesian statistics are widespread. We begin with a study of Bayes theorem, and the fundamental differences between frequentest and Bayesian analysis. We explore applications of Bayesian statistics, through well studied statistical problems (both within and outside of physics).

Topics include:

  • Frequentist & Bayesian statistics
  • Bayes theory
  • Prior specification
  • Hypothesis testing

Section 4: Parameter Simulation, Optimisation, and Analysis (Weeks 9-12)

We apply our understanding of Bayesian statistics to the common problems of parameter simulation, optimisation, and inference. Students will learn the fundamentals of Monte Carlo Simulation, Markov Chain Monte Carlo (MCMC) analysis, hypothesis testing, and quantifying goodness-of-fit. We discuss common errors in parameter inference, including standard physical and astrophysical biases that corrupt statistical analyses.

Topics include:

  • Monte Carlo Simulation
  • Markov-Chain Monte Carlo
  • Fitting high dimensional data
  • Machine Learning

Rmarkdown

Slides and lecture notes for this course are prepared in Rmarkdown, and provided to you after the lectures.

The utility of Rmarkdown is that it allows running execution of code chunks alongside markdown-style text, in a wide array of languages, including R, python, bash, Rcpp, javascript, and more. This allows us to present examples in multiple languages easily within one document. For example, if I want to plot a function, I can do so:
#in R
theta=seq(0,2,by=0.01)
sinu=1+sin(2*pi*theta)
magplot(theta,sinu,type='l')
#or in python
import numpy as np
import matplotlib.pyplot as plt
theta=np.arange(0.,2.,0.01)
sinu=1+np.sin(2*np.pi*theta)
plt.plot(theta,sinu)
plt.show()
Information generated and stored within blocks is persistent, and code-blocks with different engines can also cross-communicate. This means that, for example, we can:
#Create some data in R
#E.g. random draws from f~N(0,1)
x=rnorm(1e3) 
y=rnorm(1e3) 
#Default plot in R
plot(x,y)
#and access it directly in python
plt.scatter(r.x,r.y)
plt.show()

Slido

In addition to the lecture notes and slides, all students can post on-the-fly questions about the materials via slido. To access slido, you simply scan the QR code with your smartphone camera or join at slido.com using the event ID listed in the window.

Once you are connected to the event, you simply post your questions whenever you like. You should all try to ask as many questions as you like! The reasons that this will work best when you ask as many questions as possible are:

  • Questions can be asked anonymously, so you can ask freely and without anxiety.
  • All questions are good questions!!
  • Once a question is registered, everyone is able to add support by clicking the “thumbs up”.
    • Think of this as being a way of saying “Oh! I want to know the answer to that too!”.
    • Questions are ranked by popularity, so vote for questions you want to have answered too!
    • I’ll try to answer all questions, but the most popular questions will float to the top and get answered quickly!
  • Most importantly: I will be able to use questions that were asked and/or were popular to update the lecture notes
    • Course notes will better cover things you don’t understand
    • Course will be more relevant to you
    • The course will improve over time!

So ask whenever and about whatever you think is relevant/interesting/unclear. The more the better!

Statistics and Computing

This is a lecture course on Statistics and Statistical methods; so why do we care about computer code?!

It is of course possible to discuss statistics entirely on-paper. However using computer code allows us to explore practical statistics in real time, rather than limiting ourselves purely to equations and diagrams (Full disclosure: We will use these a lot in this course too!).

Additionally (and probably more importantly) the goal of this course is for you to learn how to apply statistics to the problems that you will encounter over the course of your academic careers; and to do so properly. Therefore, an entirely on-paper statistics course isn’t likely to be overly useful. My goal is for you to be able to leave this course with the skills to actually apply these concepts in real-life applications without difficulty.

Why do we need programming?

Modern physics and astronomy requires an understanding of programming. From theoreticians writing models to experimentalists writing analysis pipelines, most physicists and astronomers will use read, write, or use a computer program every day.

An excellent example of this is the N-body simulation. In 1941, Erik Holmberg performed the first simulations of colliding galaxies, 20 years prior to a famous work by Sebastian von Hoerner that established the field (and name) N-body Simulations.

von Hoerner 1960

Holmberg 1941

Holmberg’s work was exceptional for a number of reasons, but has become famous because of how it was completed. Holmberg simulated the collisions of rotating spiral galaxies:

Holmberg 1941

And generated tidal disruption features that are now seen commonly in merging spiral galaxies:

merger simulation

The surprise? His work was computed entirely by hand. Holmberg used arrangements of lightbulbs to simulate groups of stars, and photometers to compute the gravitational pull of all mass-elements on each-other per unit time.

Holmberg Blub arrangement

Of course nowadays we can run a simulation like this in seconds on any laptop (or smartphone, if you really want to!). This allows measurements to be more accurate, more detailed, and more reproducible. All of these are fundamental to modern natural science.

Modern Science and the Requirement of Programming

The ubiquitousness of programming in the modern physical sciences is linked to the important role that statistics plays in these fields. Modern science is increasingly reliant on large and/or complex datasets, which fundamentally must be analysed using modern statistical methods. A classic example is model optimisation (which we will cover in detail in this course): let us take a relatively simple dataset that we know follows a generic beta distribution, and attempt to model this dataset using a function containing 2 parameters:

\[\begin{multline} Y = Beta(X, \alpha,\beta) + {\rm noise} \\ 0\leq X \leq 1 \\ \alpha, \beta \in [0,\infty) \end{multline}\]

One might be inclined to attempt to fit a model to these data by-hand, using trial and error:

Using this approach we can get a reasonable fit with \(\alpha=2.8,\beta=6.7\):

But is this solution near the truth? How close is good enough? And what are the uncertainties on the parameters? These are all important in modern science, and they are precisely the sort of questions/problems that computers/programs are designed to tackle. With just one command, we can use R/python to reach a more accurate solution, in a fraction of the time.

#Estimate parameters using Nonlinear Least Squares (nls) in R 
fit_R=nls(y~dbeta(x,alpha,beta), #The function to fit
          data=list(x=x,y=y),    #The data 
          start=list(alpha=2,beta=5), #The initial guess
          algorithm='port',      #The algorithm 
          lower=c(0,0))          #The lower bounds
best_R=summary(fit_R)$parameters
#Estimate parameters using scipy.optimize.curve_fit in python
import scipy.optimize
import scipy.stats
best_py, cov_py = scipy.optimize.curve_fit(
         scipy.stats.beta.pdf, #The function to fit
         r.x, r.y,         #The data
         p0=[2,5],         #The initial guess 
         bounds=(0,np.inf),#The lower and upper bounds
         method='trf')     #The fitting algorithm

Obviously these fits are superior to those which we can reach by-hand in terms of accuracy, effort, and runtime. But the most important benefit is in terms of uncertainty estimation. Statistical computing is important for a wide range of reasons, but arguably the first and most important reason is for the computation of measures of uncertainty.

#Model statistics in R
summary(fit_R)
## 
## Formula: y ~ dbeta(x, alpha, beta)
## 
## Parameters:
##       Estimate Std. Error t value Pr(>|t|)    
## alpha  3.21166    0.01453   221.0   <2e-16 ***
## beta   7.70586    0.03728   206.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09854 on 998 degrees of freedom
## 
## Algorithm "port", convergence message: both X-convergence and relative convergence (5)

And the equivalent in python:

#Model parameters and covariance in python
print(best_py,cov_py) 
## [3.21166224 7.70585894] [[0.00021118 0.00050005]
##  [0.00050005 0.00138966]]

Do I need to know R or Python?

Despite what the internet will tell you, today there is very little separating the two languages in terms of functionality. Both languages can be run as a subprocess of the other, both have well developed tools for data analysis and machine learning, and both have a wealth of tutorials and guides to help new users enter the game.

  • Overall, there is one major consideration that will (and should) drive your choice of which language to learn first:

  • What languages do your colleagues use most?

Focus on whichever suits you best (and understand both if you can)

Any perceived benefit or detriment of the languages will invariably be overwhelmed by whether or not you are able to share and discuss code together with your colleagues. So, this is mostly a case where joining the herd is probably the sensible choice.

What’s the difference between R and Python?

R

R was originally developed as a statistics and data analysis language, and so many data analysis tools are available natively within base R. Additional tools are available through the Comprehensive R Archive Network (CRAN), which contains over 10k packages that are all required to be fully documented. This means that if you want to perform a particular flavour of statistical analysis, there is a good chance that well developed code already exists within R to do it (or at least to get you started).

Python

Python’s strength lies within the its use as a general programming language beyond data analysis. Packages are available to install via conda, and are generally reliable despite (often) a lack of documentation.

What do I use?

Personally I code primarily in R. This is useful for this course, as much of the analysis tools that we will use are available in base R. Nonetheless, as I said above, it is entirely possible to redo much of this analysis in python. Typically the only draw-back to doing so is that the code is longer (most complex models in R can be specified in few lines).

What will you see in this course?

In practice, the vast majority of examples in this course will be programmed in R.

However, I am happy to rewrite examples in python that you think would be particularly useful!

So please let me know which of the examples/problems that we explore during the course that you would like to see written in python, and I will add them to the course notes.

Finally, as you will likely see in the following sections, if you can understand one, then you can probably understand both. In this lecture we will go through some examples of R and python code, so that you have an introduction to the important parts (and can follow along without much trouble).

A Crash Course in R and Python

For the remainder of this chapter, we will be going through a crash course in R/python basics. There are many online tools that you can use to teach yourself both of these languages. In this course, you will be seeing a fair bit of R (in particular) but also python. If you are already familiar with python, then this section may be useful as a “Rosetta stone” of sorts. If you are unfamiliar with either language, then this section will hopefully give an introduction to the syntax/methods for using R and/or python.

A few important NBs:

  • Much of the information here focuses on simplicity rather than efficiency/elegance.

  • There may be other/better ways to perform the below operations.

  • This list of “good/important things to know” is certainly not exhaustive.

  • You can always learn more advanced operations in R and python via this very useful website.

Additionally, I am certainly not a python expert. I have tried to construct the below comparisons/conversions between R and python in the fairest possible manner. If you think that there is a simpler/more efficient/more elegant implementation of any python snippets below (or if something I’ve written is just plain wrong!) then please let me know and I will update the notes accordingly!

The following slides cover:

  • Variable assignment
  • Variable types
  • Data types
  • Code structure
  • Data indexing
  • Installing and loading libraries/packages
  • Reading and Writing data
  • Plotting
  • Interfacing between R and python

Installing and loading Libraries

The first important step in using R and python efficiently is to understand how to install and load libraries/packages. These are tools which have been written by a third party and made available for all users to access and use. Both R and python have a plethora of available packages. It is very rare to need to code up a statistical method yourself, as it has most likely already been written (with speed tricks and important checks/balances). Packages in R are installed from within the R session, whereas python packages are installed from the command line with a separate function “pip”:
#within the R session
#Install a packages called "remotes"
install.packages("remotes") 
#from the _commandline_
#install the numpy package using pip 
pip install numpy
To load these packages into R and python something we’ve already seen a few times:
#in R: load the remotes package
library("remotes") 
#in python: load numpy
import numpy as np

The “as np” section of the import in python is not required, but it makes using the package in python a lot simpler. This is because python tends to push users towards “object-oriented” programming style, whereas R lends itself naturally to a more “functional” programming style. We’ll discuss what this means later.

In R, packages are available primarily through CRAN. Some packages are available through the separate “Bioconductor” entity, and these must be installed differently (but similarly easily). The “remotes” package that we just installed and subsequently loaded, however, allows us to directly install packages that are on, e.g., github. Python has similar functionality within pip:
#in R: load the remotes package
library("remotes") 
#and install the "Rfits" package 
#from github user ASGR:
install_github("ASGR/Rfits")
#from the commandline
#pip install the django package
pip install git+https://github.com/django/django.git

Code Structure and Control Functions

An important difference between the python and R languages is the format of the code itself. Python imposes strict formatting requirements on the code, whereby blocks of code are linked together via leading whitespaces (i.e. indentation). R imposes no formatting restriction on codeblocks, instead using brackets. This means that R can seem overly verbose with brackets at times. As a demonstration, here is the formatting required for a standard set of control functions that are used in R and python: if, for, and while statements.

#Conditional statements in R
#If statements 
if (condition) { 
  #Evaluate if condition == true
} else { 
  #Evaluate if condition == false
}

#For statements 
for (var in sequence) { 
  #Evaluate 
}